Nikhil Mali, James Peng, Timothy Welsh, Steven Zhang
Python package versions:
Pillow: 9.5.0
WordCloud: 1.9.1.1
scikit-learn: 1.2.2
pandas: 2.0.1
matplotlib: 3.5.0
seaborn: 0.11.2
numpy: 1.21.5
google-api-python-client: 2.86.0
google-auth: 2.18.0
google-auth-oauthlib: 1.0.0
requests: 2.28.0
circlify: 0.15.0
Preliminary Graphs
Top 50 Videos
Word Cloud
Circle Packing Chart
More Data Collection: Profile Picture Extraction from Youtube API
More Data Cleanup: Cropping Images
Graphing
Classification
Analysis: What Makes a Video Trending?
Comparing Likes, Viewcount, and Comments
This project examines all aspects of videos that make it onto the youtube trending page. Youtube has been a driving force in pop culture, as the most popular video sharing website and the second most visited website in the world. As one of the largest entertainment websites in the world, it's worth examining since what is popular on youtube are likely things that have the ability to reach large audiences across the world and their messages reflect popular opinion, possibly even more effectively than governments.
Thus, our project investigates many questions surrounding youtube's trending page including questions like what it takes to make it onto the youtube trending list, is there a relationship between views and likes on youtube's trending videos, and what channels frequently appear on youtubes trending. Throughout, we include several techniques for visualizing this information, including scatterplots, word clouds, bubble diagrams, classification methods, and others.
We found a dataset that contains statistics on trending youtube videos from kaggle. We examined all aspects of the data provided in the kaggle dataset including video tags, channel names, categories, viewcount, likes, dislikes (before they were removed), comment count, release/trending date, and others. We also included data scraped from the Youtube API to draw our bubble plot and project the channel profile picture onto it to visualize which channels appear often in trending. Overall, we provide a comprehensive breakdown of this dataset and the trends that exist throughout.
The data set contains trending video data from as far back as August 12, 2020, and it is still being updated today. We downloaded it on April 30. It contains data for 11 different countries, but we will only use US trending data. Here is what it looks like:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
data = pd.read_csv("US_youtube_trending_data.csv")
print(data.shape)
data.head()
(199190, 16)
| video_id | title | publishedAt | channelId | channelTitle | categoryId | trending_date | tags | view_count | likes | dislikes | comment_count | thumbnail_link | comments_disabled | ratings_disabled | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3C66w5Z0ixs | I ASKED HER TO BE MY GIRLFRIEND... | 2020-08-11T19:20:14Z | UCvtRTOMP2TqYqu51xNrqAzg | Brawadis | 22 | 2020-08-12T00:00:00Z | brawadis|prank|basketball|skits|ghost|funny vi... | 1514614 | 156908 | 5855 | 35313 | https://i.ytimg.com/vi/3C66w5Z0ixs/default.jpg | False | False | SUBSCRIBE to BRAWADIS ▶ http://bit.ly/Subscrib... |
| 1 | M9Pmf9AB4Mo | Apex Legends | Stories from the Outlands – “Th... | 2020-08-11T17:00:10Z | UC0ZV6M2THA81QT9hrVWJG3A | Apex Legends | 20 | 2020-08-12T00:00:00Z | Apex Legends|Apex Legends characters|new Apex ... | 2381688 | 146739 | 2794 | 16549 | https://i.ytimg.com/vi/M9Pmf9AB4Mo/default.jpg | False | False | While running her own modding shop, Ramya Pare... |
| 2 | J78aPJ3VyNs | I left youtube for a month and THIS is what ha... | 2020-08-11T16:34:06Z | UCYzPXprvl5Y-Sf0g4vX-m6g | jacksepticeye | 24 | 2020-08-12T00:00:00Z | jacksepticeye|funny|funny meme|memes|jacksepti... | 2038853 | 353787 | 2628 | 40221 | https://i.ytimg.com/vi/J78aPJ3VyNs/default.jpg | False | False | I left youtube for a month and this is what ha... |
| 3 | kXLn3HkpjaA | XXL 2020 Freshman Class Revealed - Official An... | 2020-08-11T16:38:55Z | UCbg_UMjlHJg_19SZckaKajg | XXL | 10 | 2020-08-12T00:00:00Z | xxl freshman|xxl freshmen|2020 xxl freshman|20... | 496771 | 23251 | 1856 | 7647 | https://i.ytimg.com/vi/kXLn3HkpjaA/default.jpg | False | False | Subscribe to XXL → http://bit.ly/subscribe-xxl... |
| 4 | VIUo6yapDbc | Ultimate DIY Home Movie Theater for The LaBran... | 2020-08-11T15:10:05Z | UCDVPcEbVLQgLZX0Rt6jo34A | Mr. Kate | 26 | 2020-08-12T00:00:00Z | The LaBrant Family|DIY|Interior Design|Makeove... | 1123889 | 45802 | 964 | 2196 | https://i.ytimg.com/vi/VIUo6yapDbc/default.jpg | False | False | Transforming The LaBrant Family's empty white ... |
The US data also comes with a json file containing category ID information:
import json
category_info = json.load(open("US_category_id.json"))
print(category_info.keys())
category_info['items'][:5]
dict_keys(['kind', 'etag', 'items'])
[{'kind': 'youtube#videoCategory',
'etag': 'IfWa37JGcqZs-jZeAyFGkbeh6bc',
'id': '1',
'snippet': {'title': 'Film & Animation',
'assignable': True,
'channelId': 'UCBR8-60-B28hp2BmDPdntcQ'}},
{'kind': 'youtube#videoCategory',
'etag': '5XGylIs7zkjHh5940dsT5862m1Y',
'id': '2',
'snippet': {'title': 'Autos & Vehicles',
'assignable': True,
'channelId': 'UCBR8-60-B28hp2BmDPdntcQ'}},
{'kind': 'youtube#videoCategory',
'etag': 'HCjFMARbBeWjpm6PDfReCOMOZGA',
'id': '10',
'snippet': {'title': 'Music',
'assignable': True,
'channelId': 'UCBR8-60-B28hp2BmDPdntcQ'}},
{'kind': 'youtube#videoCategory',
'etag': 'ra8H7xyAfmE2FewsDabE3TUSq10',
'id': '15',
'snippet': {'title': 'Pets & Animals',
'assignable': True,
'channelId': 'UCBR8-60-B28hp2BmDPdntcQ'}},
{'kind': 'youtube#videoCategory',
'etag': '7mqChSJogdF3hSIL-88BfDE-W8M',
'id': '17',
'snippet': {'title': 'Sports',
'assignable': True,
'channelId': 'UCBR8-60-B28hp2BmDPdntcQ'}}]
We don't need a lot of this info, We really just want to connect the category ID with the category title. Let's extract the those two and put it in a dictionary.
categories = {}
for entry in category_info["items"]:
categories[entry["id"]] = entry["snippet"]["title"]
print(categories)
{'1': 'Film & Animation', '2': 'Autos & Vehicles', '10': 'Music', '15': 'Pets & Animals', '17': 'Sports', '18': 'Short Movies', '19': 'Travel & Events', '20': 'Gaming', '21': 'Videoblogging', '22': 'People & Blogs', '23': 'Comedy', '24': 'Entertainment', '25': 'News & Politics', '26': 'Howto & Style', '27': 'Education', '28': 'Science & Technology', '29': 'Nonprofits & Activism', '30': 'Movies', '31': 'Anime/Animation', '32': 'Action/Adventure', '33': 'Classics', '34': 'Comedy', '35': 'Documentary', '36': 'Drama', '37': 'Family', '38': 'Foreign', '39': 'Horror', '40': 'Sci-Fi/Fantasy', '41': 'Thriller', '42': 'Shorts', '43': 'Shows', '44': 'Trailers'}
For further ease of access, let's create a new column with the category titles.
data["category_title"] = data["categoryId"].map(lambda ID: categories[str(ID)])
data.head(5)
| video_id | title | publishedAt | channelId | channelTitle | categoryId | trending_date | tags | view_count | likes | dislikes | comment_count | thumbnail_link | comments_disabled | ratings_disabled | description | category_title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3C66w5Z0ixs | I ASKED HER TO BE MY GIRLFRIEND... | 2020-08-11T19:20:14Z | UCvtRTOMP2TqYqu51xNrqAzg | Brawadis | 22 | 2020-08-12T00:00:00Z | brawadis|prank|basketball|skits|ghost|funny vi... | 1514614 | 156908 | 5855 | 35313 | https://i.ytimg.com/vi/3C66w5Z0ixs/default.jpg | False | False | SUBSCRIBE to BRAWADIS ▶ http://bit.ly/Subscrib... | People & Blogs |
| 1 | M9Pmf9AB4Mo | Apex Legends | Stories from the Outlands – “Th... | 2020-08-11T17:00:10Z | UC0ZV6M2THA81QT9hrVWJG3A | Apex Legends | 20 | 2020-08-12T00:00:00Z | Apex Legends|Apex Legends characters|new Apex ... | 2381688 | 146739 | 2794 | 16549 | https://i.ytimg.com/vi/M9Pmf9AB4Mo/default.jpg | False | False | While running her own modding shop, Ramya Pare... | Gaming |
| 2 | J78aPJ3VyNs | I left youtube for a month and THIS is what ha... | 2020-08-11T16:34:06Z | UCYzPXprvl5Y-Sf0g4vX-m6g | jacksepticeye | 24 | 2020-08-12T00:00:00Z | jacksepticeye|funny|funny meme|memes|jacksepti... | 2038853 | 353787 | 2628 | 40221 | https://i.ytimg.com/vi/J78aPJ3VyNs/default.jpg | False | False | I left youtube for a month and this is what ha... | Entertainment |
| 3 | kXLn3HkpjaA | XXL 2020 Freshman Class Revealed - Official An... | 2020-08-11T16:38:55Z | UCbg_UMjlHJg_19SZckaKajg | XXL | 10 | 2020-08-12T00:00:00Z | xxl freshman|xxl freshmen|2020 xxl freshman|20... | 496771 | 23251 | 1856 | 7647 | https://i.ytimg.com/vi/kXLn3HkpjaA/default.jpg | False | False | Subscribe to XXL → http://bit.ly/subscribe-xxl... | Music |
| 4 | VIUo6yapDbc | Ultimate DIY Home Movie Theater for The LaBran... | 2020-08-11T15:10:05Z | UCDVPcEbVLQgLZX0Rt6jo34A | Mr. Kate | 26 | 2020-08-12T00:00:00Z | The LaBrant Family|DIY|Interior Design|Makeove... | 1123889 | 45802 | 964 | 2196 | https://i.ytimg.com/vi/VIUo6yapDbc/default.jpg | False | False | Transforming The LaBrant Family's empty white ... | Howto & Style |
Let's convert the dtypes to be more specific.
data.dtypes
video_id object title object publishedAt object channelId object channelTitle object categoryId int64 trending_date object tags object view_count int64 likes int64 dislikes int64 comment_count int64 thumbnail_link object comments_disabled bool ratings_disabled bool description object category_title object dtype: object
Let's try to see what the top videos look like to find out more.
data["publishedAt"] = pd.to_datetime(data["publishedAt"])
data["trending_date"] = pd.to_datetime(data["trending_date"])
print(data["publishedAt"].dtype)
print(data["trending_date"].dtype)
datetime64[ns, UTC] datetime64[ns, UTC]
sort = data.sort_values("view_count", ascending = False)
sort.head(10)
| video_id | title | publishedAt | channelId | channelTitle | categoryId | trending_date | tags | view_count | likes | dislikes | comment_count | thumbnail_link | comments_disabled | ratings_disabled | description | category_title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 152788 | gQlMMD8auMs | BLACKPINK - ‘Pink Venom’ M/V | 2022-08-19 04:00:13+00:00 | UCOmHUn--16B90oW2L6FRR3A | BLACKPINK | 10 | 2022-09-10 00:00:00+00:00 | YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... | 277791741 | 12993894 | 0 | 3534337 | https://i.ytimg.com/vi/gQlMMD8auMs/default.jpg | False | False | BLACKPINK - ‘Pink Venom’ M/VKick in the door W... | Music |
| 152568 | gQlMMD8auMs | BLACKPINK - ‘Pink Venom’ M/V | 2022-08-19 04:00:13+00:00 | UCOmHUn--16B90oW2L6FRR3A | BLACKPINK | 10 | 2022-09-09 00:00:00+00:00 | YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... | 273162966 | 12937252 | 0 | 3516745 | https://i.ytimg.com/vi/gQlMMD8auMs/default.jpg | False | False | BLACKPINK - ‘Pink Venom’ M/VKick in the door W... | Music |
| 152365 | gQlMMD8auMs | BLACKPINK - ‘Pink Venom’ M/V | 2022-08-19 04:00:13+00:00 | UCOmHUn--16B90oW2L6FRR3A | BLACKPINK | 10 | 2022-09-08 00:00:00+00:00 | YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... | 268758295 | 12882841 | 0 | 3504692 | https://i.ytimg.com/vi/gQlMMD8auMs/default.jpg | False | False | BLACKPINK - ‘Pink Venom’ M/VKick in the door W... | Music |
| 152175 | gQlMMD8auMs | BLACKPINK - ‘Pink Venom’ M/V | 2022-08-19 04:00:13+00:00 | UCOmHUn--16B90oW2L6FRR3A | BLACKPINK | 10 | 2022-09-07 00:00:00+00:00 | YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... | 264459017 | 12829059 | 0 | 3491132 | https://i.ytimg.com/vi/gQlMMD8auMs/default.jpg | False | False | BLACKPINK - ‘Pink Venom’ M/VKick in the door W... | Music |
| 56374 | WMweEpGlu_U | BTS (방탄소년단) 'Butter' Official MV | 2021-05-21 03:46:13+00:00 | UC3IZKseVpdzPSBaWxBxundA | HYBE LABELS | 10 | 2021-05-30 00:00:00+00:00 | BIGHIT|빅히트|방탄소년단|BTS|BANGTAN|방탄 | 264407389 | 16021534 | 150989 | 6738537 | https://i.ytimg.com/vi/WMweEpGlu_U/default.jpg | False | False | BTS (방탄소년단) 'Butter' Official MV Credits: Dire... | Music |
| 151968 | gQlMMD8auMs | BLACKPINK - ‘Pink Venom’ M/V | 2022-08-19 04:00:13+00:00 | UCOmHUn--16B90oW2L6FRR3A | BLACKPINK | 10 | 2022-09-06 00:00:00+00:00 | YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... | 260126694 | 12773474 | 0 | 3479717 | https://i.ytimg.com/vi/gQlMMD8auMs/default.jpg | False | False | BLACKPINK - ‘Pink Venom’ M/VKick in the door W... | Music |
| 151773 | gQlMMD8auMs | BLACKPINK - ‘Pink Venom’ M/V | 2022-08-19 04:00:13+00:00 | UCOmHUn--16B90oW2L6FRR3A | BLACKPINK | 10 | 2022-09-05 00:00:00+00:00 | YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... | 255524865 | 12715882 | 0 | 3466961 | https://i.ytimg.com/vi/gQlMMD8auMs/default.jpg | False | False | BLACKPINK - ‘Pink Venom’ M/VKick in the door W... | Music |
| 151570 | gQlMMD8auMs | BLACKPINK - ‘Pink Venom’ M/V | 2022-08-19 04:00:13+00:00 | UCOmHUn--16B90oW2L6FRR3A | BLACKPINK | 10 | 2022-09-04 00:00:00+00:00 | YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... | 250963177 | 12653408 | 0 | 3450920 | https://i.ytimg.com/vi/gQlMMD8auMs/default.jpg | False | False | BLACKPINK - ‘Pink Venom’ M/VKick in the door W... | Music |
| 151372 | gQlMMD8auMs | BLACKPINK - ‘Pink Venom’ M/V | 2022-08-19 04:00:13+00:00 | UCOmHUn--16B90oW2L6FRR3A | BLACKPINK | 10 | 2022-09-03 00:00:00+00:00 | YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... | 245994768 | 12577694 | 0 | 3438048 | https://i.ytimg.com/vi/gQlMMD8auMs/default.jpg | False | False | BLACKPINK - ‘Pink Venom’ M/VKick in the door W... | Music |
| 151167 | gQlMMD8auMs | BLACKPINK - ‘Pink Venom’ M/V | 2022-08-19 04:00:13+00:00 | UCOmHUn--16B90oW2L6FRR3A | BLACKPINK | 10 | 2022-09-02 00:00:00+00:00 | YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... | 240757307 | 12502699 | 0 | 3424014 | https://i.ytimg.com/vi/gQlMMD8auMs/default.jpg | False | False | BLACKPINK - ‘Pink Venom’ M/VKick in the door W... | Music |
It looks like since the data collection method for this dataset just scans the videos on the trending tab every day, videos with the same ID/title can have multiple entries.
data.sort_values("trending_date", ascending = False)
data.head()
| video_id | title | publishedAt | channelId | channelTitle | categoryId | trending_date | tags | view_count | likes | dislikes | comment_count | thumbnail_link | comments_disabled | ratings_disabled | description | category_title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3C66w5Z0ixs | I ASKED HER TO BE MY GIRLFRIEND... | 2020-08-11 19:20:14+00:00 | UCvtRTOMP2TqYqu51xNrqAzg | Brawadis | 22 | 2020-08-12 00:00:00+00:00 | brawadis|prank|basketball|skits|ghost|funny vi... | 1514614 | 156908 | 5855 | 35313 | https://i.ytimg.com/vi/3C66w5Z0ixs/default.jpg | False | False | SUBSCRIBE to BRAWADIS ▶ http://bit.ly/Subscrib... | People & Blogs |
| 1 | M9Pmf9AB4Mo | Apex Legends | Stories from the Outlands – “Th... | 2020-08-11 17:00:10+00:00 | UC0ZV6M2THA81QT9hrVWJG3A | Apex Legends | 20 | 2020-08-12 00:00:00+00:00 | Apex Legends|Apex Legends characters|new Apex ... | 2381688 | 146739 | 2794 | 16549 | https://i.ytimg.com/vi/M9Pmf9AB4Mo/default.jpg | False | False | While running her own modding shop, Ramya Pare... | Gaming |
| 2 | J78aPJ3VyNs | I left youtube for a month and THIS is what ha... | 2020-08-11 16:34:06+00:00 | UCYzPXprvl5Y-Sf0g4vX-m6g | jacksepticeye | 24 | 2020-08-12 00:00:00+00:00 | jacksepticeye|funny|funny meme|memes|jacksepti... | 2038853 | 353787 | 2628 | 40221 | https://i.ytimg.com/vi/J78aPJ3VyNs/default.jpg | False | False | I left youtube for a month and this is what ha... | Entertainment |
| 3 | kXLn3HkpjaA | XXL 2020 Freshman Class Revealed - Official An... | 2020-08-11 16:38:55+00:00 | UCbg_UMjlHJg_19SZckaKajg | XXL | 10 | 2020-08-12 00:00:00+00:00 | xxl freshman|xxl freshmen|2020 xxl freshman|20... | 496771 | 23251 | 1856 | 7647 | https://i.ytimg.com/vi/kXLn3HkpjaA/default.jpg | False | False | Subscribe to XXL → http://bit.ly/subscribe-xxl... | Music |
| 4 | VIUo6yapDbc | Ultimate DIY Home Movie Theater for The LaBran... | 2020-08-11 15:10:05+00:00 | UCDVPcEbVLQgLZX0Rt6jo34A | Mr. Kate | 26 | 2020-08-12 00:00:00+00:00 | The LaBrant Family|DIY|Interior Design|Makeove... | 1123889 | 45802 | 964 | 2196 | https://i.ytimg.com/vi/VIUo6yapDbc/default.jpg | False | False | Transforming The LaBrant Family's empty white ... | Howto & Style |
sort.head(25)["tags"]
152788 YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... 152568 YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... 152365 YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... 152175 YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... 56374 BIGHIT|빅히트|방탄소년단|BTS|BANGTAN|방탄 151968 YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... 151773 YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... 151570 YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... 151372 YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... 151167 YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... 150968 YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... 3358 BIGHIT|빅히트|방탄소년단|BTS|BANGTAN|방탄 150758 YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... 150544 YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... 3137 BIGHIT|빅히트|방탄소년단|BTS|BANGTAN|방탄 150335 YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... 150128 YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... 2894 BIGHIT|빅히트|방탄소년단|BTS|BANGTAN|방탄 149917 YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... 73564 [None] 73361 [None] 73162 [None] 72959 [None] 72751 [None] 2653 BIGHIT|빅히트|방탄소년단|BTS|BANGTAN|방탄 Name: tags, dtype: object
It looks like some videos have no tags! Although that's not a big deal, it's good to keep in mind when we use them.
plt.figure(figsize=(20,10))
plot = plt.scatter(np.array(data["publishedAt"]), np.array(data["view_count"]))
plt.ylabel("Viewcount")
plt.xlabel("Time")
plt.title("Viewcount over time")
plt.show()
plt.figure(figsize=(20,10))
plot = plt.scatter(np.array(data["view_count"]), np.array(data["likes"]))
plt.xlabel("Viewcount")
plt.ylabel("Likes")
plt.title("Viewcount and Likes")
plt.show()
plt.figure(figsize=(20,10))
plot = plt.scatter(np.array(data["view_count"]), np.array(data["dislikes"]))
plt.xlabel("Viewcount")
plt.ylabel("Dislikes")
plt.title("Viewcount and Disikes")
plt.show()
If you notice from when we showed data above, the recent video entries had 0 dislikes, because youtube disabled dislikes in November 2021. For this reason, many entries in the dislikes graph are on the very bottom, at 0 dislikes.
Some areas seem to have an abnormally high number of data points correlated to each other in a linear relationship. Why is that? Here is my hypothesis: since all of the rows are not unique, one video can appear multiple times! Thus, the abnormal lines in the data are from one video gaining many views/likes/dislikes over a short period of time! Furthermore, a video's vies/likes/dislikes grow, but tend to taper off before the video finally stops trending, which explains the concentration of points at the tip of lines, and also why it seems like the lines are moving towards higher views/likes/dislikes.
grouped = data.groupby(["title"])
print(data.shape)
grouped.ngroups
(199190, 17)
37251
The dataset originally had 199,190 rows and 16 columns. With groupby, we find that there are 37,251 unique videos.
# with assistance from https://stackoverflow.com/questions/53842287/select-rows-with-highest-value-from-groupby
unique=data.loc[grouped["view_count"].idxmax()]
unique
| video_id | title | publishedAt | channelId | channelTitle | categoryId | trending_date | tags | view_count | likes | dislikes | comment_count | thumbnail_link | comments_disabled | ratings_disabled | description | category_title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 29372 | 3BgG4bUDHa4 | !@#$%$#!! || Dubov vs Carlsen || Airthings Mas... | 2020-12-30 17:43:29+00:00 | UCL5YbN5WLFD8dLIegT5QAbA | agadmator's Chess Channel | 24 | 2021-01-09 00:00:00+00:00 | agadmator|chess|best chess channel|best youtub... | 609588 | 25234 | 283 | 2186 | https://i.ytimg.com/vi/3BgG4bUDHa4/default.jpg | False | False | Follow me on Instagram for extra content https... | Entertainment |
| 64132 | HFk73_EdK3o | #1 76ERS at #5 HAWKS | FULL GAME HIGHLIGHTS | ... | 2021-06-17 02:45:41+00:00 | UCWJ2lWNubArHWmf3FIHbfcQ | NBA | 17 | 2021-06-21 00:00:00+00:00 | Basketball|G League|NBA|game-0042000205 | 1605052 | 16700 | 677 | 5527 | https://i.ytimg.com/vi/HFk73_EdK3o/default.jpg | False | False | #1 76ERS at #5 HAWKS | FULL GAME HIGHLIGHTS | ... | Sports |
| 112783 | pIB3neebwSk | #1 Absolute Best Remedy for Dry and Wrinkled H... | 2022-02-17 11:15:05+00:00 | UC3w193M5tYPJqF0Hi-7U-2g | Dr. Eric Berg DC | 27 | 2022-02-22 00:00:00+00:00 | #1 Absolute Best Remedy for Dry and Wrinkled H... | 1318126 | 45372 | 0 | 2872 | https://i.ytimg.com/vi/pIB3neebwSk/default.jpg | False | False | Lotion may actually make your hands drier. Giv... | Education |
| 23569 | QY7ArP0ebaM | #1 Alabama Crimson Tide vs. LSU Tigers: Extend... | 2020-12-06 05:05:49+00:00 | UCja8sZ2T4ylIqjggA1Zuukg | CBS Sports HQ | 17 | 2020-12-10 00:00:00+00:00 | Alabama Crimson Tide|LSU Tigers|Alabama Crimso... | 335967 | 1644 | 102 | 493 | https://i.ytimg.com/vi/QY7ArP0ebaM/default.jpg | False | False | No. 1 Alabama leads LSU 52-17 after three quar... | Sports |
| 87371 | AtW81jzLx2o | #1 Alabama Vs Texas A&M Extended Highlights | ... | 2021-10-10 04:42:02+00:00 | UCja8sZ2T4ylIqjggA1Zuukg | CBS Sports HQ | 17 | 2021-10-16 00:00:00+00:00 | college football|Alabama|roll tide|Texas a&m|A... | 635561 | 5203 | 206 | 1228 | https://i.ytimg.com/vi/AtW81jzLx2o/default.jpg | False | False | Extended highlights from Texas A&M upset over ... | Sports |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 134781 | 7UfiCa244XE | 🥪 👜 Ma’amalade sandwich Your Majesty? | 2022-06-06 12:32:01+00:00 | UCTkC3Jt91QkqNAE4FGWkEIQ | The Royal Family | 22 | 2022-06-12 00:00:00+00:00 | [None] | 3695044 | 106396 | 0 | 0 | https://i.ytimg.com/vi/7UfiCa244XE/default.jpg | True | False | The Queen and Paddington Bear get the Platinum... | People & Blogs |
| 63555 | BNqg-1GBpLw | 🥭Fresh Farm Fruit Eating | Tiktok China | Oddl... | 2021-06-03 13:39:19+00:00 | UCyEHvAVMP9yY0Gp0ndLnwEQ | Fruit Satisfying | 22 | 2021-06-17 00:00:00+00:00 | Tik Tok China|farm fresh ninja fruit|Fresh fru... | 55054684 | 661233 | 47998 | 2087 | https://i.ytimg.com/vi/BNqg-1GBpLw/default.jpg | False | False | Fresh Farm Fruit Eating on Orange Garden | Odd... | People & Blogs |
| 121377 | VWo7X6RijrM | 🥳How To UNLOCK THE FREE EXCLUSIVE *EASY* in Pe... | 2022-04-01 17:18:08+00:00 | UC5_pDWCrgnvP4YpWsWFPv7g | Sonsss | 20 | 2022-04-06 00:00:00+00:00 | sonsss|new|roblox|sonsss pet sim x|pet simulat... | 875677 | 15328 | 0 | 2970 | https://i.ytimg.com/vi/VWo7X6RijrM/default.jpg | False | False | 🥳How To UNLOCK THE FREE EXCLUSIVE *EASY* in Pe... | Gaming |
| 117943 | ClSgM70C6r0 | 🦌 WOODLAND EGG UPDATE! 🌲 8 NEW PETS! 😲 Adopt M... | 2022-03-16 18:00:12+00:00 | UCVdPM7Dgxm3cHXM2ro__bUg | PlayAdoptMe | 20 | 2022-03-20 00:00:00+00:00 | adoptme|playadoptme|playadoptmeroblox|adoptmer... | 383351 | 23615 | 0 | 3298 | https://i.ytimg.com/vi/ClSgM70C6r0/default.jpg | False | False | 🥚 Woodland Egg Update! 🥚🦌 8 new woodsy creatur... | Gaming |
| 116389 | YSaE8OWwJZE | 🦝Raccoon Powers 🦝| Ep. 1 | Afterlife Minecraft... | 2022-03-04 17:30:03+00:00 | UCzTlXb7ivVzuFlugVCv3Kvg | LDShadowLady | 20 | 2022-03-12 00:00:00+00:00 | ldshadowlady|minecraft|mini game|girl gamer|pi... | 2132503 | 144402 | 0 | 8865 | https://i.ytimg.com/vi/YSaE8OWwJZE/default.jpg | False | False | Please *boop* the like button if you enjoy the... | Gaming |
37251 rows × 17 columns
This is after taking the row with the highest views for each unique video. Let's plot these with the same graphs.
plt.figure(figsize=(20,10))
plot = plt.scatter(np.array(unique["publishedAt"]), np.array(unique["view_count"]))
plt.ylabel("Viewcount")
plt.xlabel("Time")
plt.title("Viewcount over time")
plt.show()
plt.figure(figsize=(20,10))
plot = plt.scatter(np.array(unique["view_count"]), np.array(unique["likes"]))
plt.xlabel("Viewcount")
plt.ylabel("Likes")
plt.title("Viewcount and Likes")
plt.show()
plt.figure(figsize=(20,10))
plot = plt.scatter(np.array(unique["view_count"]), np.array(unique["dislikes"]))
plt.xlabel("Viewcount")
plt.ylabel("Dislikes")
plt.title("Viewcount and Disikes")
plt.show()
As you can see, there are no more lines! This supports our earlier hypothesis.
Now let's try to color code these videos based on category. We'll use seaborn for this.
plt.figure(figsize=(20,13))
plot = sns.scatterplot(x='publishedAt', y='view_count', data=unique, hue='category_title')
plt.legend(title='Category', bbox_to_anchor=(1.0, 1.0), loc="upper left")
plt.ylabel("Viewcount")
plt.xlabel("Time")
plt.title("Viewcount over time")
plt.show()
plt.figure(figsize=(20,13))
plot = sns.scatterplot(x='view_count', y='likes', data=unique, hue='category_title')
plt.legend(title='Category', bbox_to_anchor=(1.0, 1.0), loc="upper left")
plt.xlabel("Viewcount")
plt.ylabel("Likes")
plt.title("Viewcount and Likes")
plt.show()
plt.figure(figsize=(20,13))
plot = sns.scatterplot(x='view_count', y='dislikes', data=unique, hue='category_title')
plt.legend(title='Category', bbox_to_anchor=(1.0, 1.0), loc="upper left")
plt.xlabel("Viewcount")
plt.ylabel("Dislikes")
plt.title("Viewcount and Disikes")
plt.show()
First of all, one might ask: why are there only 15 categories listed in the legend?
print(pd.unique(unique["category_title"]))
['Entertainment' 'Sports' 'Education' 'Music' 'Film & Animation' 'Science & Technology' 'People & Blogs' 'Comedy' 'Gaming' 'Travel & Events' 'News & Politics' 'Autos & Vehicles' 'Howto & Style' 'Pets & Animals' 'Nonprofits & Activism']
Looks like although there were 44 categories, only 15 categories made it to the trending tab.
It's a bit hard to decipher which color some points actually belong to, so let's count the occurences of each category to make sure.
unique['category_title'].value_counts()
category_title Gaming 7467 Entertainment 7274 Music 5839 Sports 4623 People & Blogs 3255 Comedy 1880 Film & Animation 1394 News & Politics 1369 Science & Technology 1140 Howto & Style 996 Education 902 Autos & Vehicles 732 Travel & Events 205 Pets & Animals 157 Nonprofits & Activism 18 Name: count, dtype: int64
Music videos, which are brown, seem to dominate. Entertainment videos, which are pink, also seem to be common. Gaming videos seem to not show up as much on the graphs, even though they are the top category. I believe this is because gaming videos are very unlikely to be outliers in terms of engagement, because they are usually niche. In the likes and dislikes graph, they are probably concentrated toward the bottom right.
Let's try to find the top 50 videos in the dataset by views. We can do this easily by sorting.
sorted_unique = unique.sort_values(by="view_count", ascending=False)
sorted_unique.head(50)
| video_id | title | publishedAt | channelId | channelTitle | categoryId | trending_date | tags | view_count | likes | dislikes | comment_count | thumbnail_link | comments_disabled | ratings_disabled | description | category_title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 152788 | gQlMMD8auMs | BLACKPINK - ‘Pink Venom’ M/V | 2022-08-19 04:00:13+00:00 | UCOmHUn--16B90oW2L6FRR3A | BLACKPINK | 10 | 2022-09-10 00:00:00+00:00 | YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... | 277791741 | 12993894 | 0 | 3534337 | https://i.ytimg.com/vi/gQlMMD8auMs/default.jpg | False | False | BLACKPINK - ‘Pink Venom’ M/VKick in the door W... | Music |
| 56374 | WMweEpGlu_U | BTS (방탄소년단) 'Butter' Official MV | 2021-05-21 03:46:13+00:00 | UC3IZKseVpdzPSBaWxBxundA | HYBE LABELS | 10 | 2021-05-30 00:00:00+00:00 | BIGHIT|빅히트|방탄소년단|BTS|BANGTAN|방탄 | 264407389 | 16021534 | 150989 | 6738537 | https://i.ytimg.com/vi/WMweEpGlu_U/default.jpg | False | False | BTS (방탄소년단) 'Butter' Official MV Credits: Dire... | Music |
| 3358 | gdZLi9oWNZg | BTS (방탄소년단) 'Dynamite' Official MV | 2020-08-21 03:58:10+00:00 | UC3IZKseVpdzPSBaWxBxundA | Big Hit Labels | 10 | 2020-08-28 00:00:00+00:00 | BIGHIT|빅히트|방탄소년단|BTS|BANGTAN|방탄 | 232649205 | 15735533 | 714194 | 6065230 | https://i.ytimg.com/vi/gdZLi9oWNZg/default.jpg | False | False | BTS (방탄소년단) 'Dynamite' Official MVCredits:Dire... | Music |
| 73564 | hdmx71UjBXs | Turn into orbeez - Tutorial #Shorts | 2021-07-03 04:04:57+00:00 | UCt8z2S30Wl-GQEluFVM8NUw | FFUNTV | 24 | 2021-08-08 00:00:00+00:00 | [None] | 206202284 | 6840430 | 240769 | 2826 | https://i.ytimg.com/vi/hdmx71UjBXs/default.jpg | False | False | Turn into orbeez - Tutorial #ShortsHey guys! W... | Entertainment |
| 4980 | vRXZj0DzXIA | BLACKPINK - 'Ice Cream (with Selena Gomez)' M/V | 2020-08-28 04:00:11+00:00 | UCOmHUn--16B90oW2L6FRR3A | BLACKPINK | 10 | 2020-09-05 00:00:00+00:00 | YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... | 184778248 | 11795670 | 879354 | 2735997 | https://i.ytimg.com/vi/vRXZj0DzXIA/default.jpg | False | False | BLACKPINK - ‘Ice Cream (with Selena Gomez)’Com... | Music |
| 159786 | POe9SOEKotk | BLACKPINK - ‘Shut Down’ M/V | 2022-09-16 04:00:12+00:00 | UCOmHUn--16B90oW2L6FRR3A | BLACKPINK | 10 | 2022-10-15 00:00:00+00:00 | YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... | 180654898 | 8438151 | 0 | 1326586 | https://i.ytimg.com/vi/POe9SOEKotk/default.jpg | False | False | BLACKPINK - ‘Shut Down’Blackpink in your areaB... | Music |
| 198978 | YudHcBIxlYw | JISOO - ‘꽃(FLOWER)’ M/V | 2023-03-31 04:00:14+00:00 | UCOmHUn--16B90oW2L6FRR3A | BLACKPINK | 10 | 2023-04-29 00:00:00+00:00 | YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... | 170333822 | 8553612 | 0 | 1162689 | https://i.ytimg.com/vi/YudHcBIxlYw/default.jpg | False | False | JISOO - ‘꽃(FLOWER)’ ABC 도레미만큼 착했던 나그 눈빛이 싹 변했지... | Music |
| 179180 | CocEMWdc7Ck | SHAKIRA || BZRP Music Sessions #53 | 2023-01-12 00:00:07+00:00 | UCmS75G-98QihSusY7NfCZtw | Bizarrap | 24 | 2023-01-20 00:00:00+00:00 | bizarrap|biza|bisa|bizzarrap|bzrp|bzrp music s... | 158477831 | 8333879 | 0 | 468245 | https://i.ytimg.com/vi/CocEMWdc7Ck/default.jpg | False | False | SHAKIRA || BZRP Music Sessions #53Lyrics by: h... | Entertainment |
| 68979 | CuklIb9d3fI | BTS (방탄소년단) 'Permission to Dance' Official MV | 2021-07-09 03:59:12+00:00 | UC3IZKseVpdzPSBaWxBxundA | HYBE LABELS | 10 | 2021-07-16 00:00:00+00:00 | HYBE|HYBE LABELS|하이브|하이브레이블즈 | 156482499 | 12117314 | 102132 | 2781218 | https://i.ytimg.com/vi/CuklIb9d3fI/default.jpg | False | False | BTS (방탄소년단) 'Permission to Dance' Official MVC... | Music |
| 81347 | awkkyBH2zEo | LISA - 'LALISA' M/V | 2021-09-10 04:00:13+00:00 | UCOmHUn--16B90oW2L6FRR3A | BLACKPINK | 10 | 2021-09-16 00:00:00+00:00 | YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... | 154134590 | 11348978 | 113448 | 2402692 | https://i.ytimg.com/vi/awkkyBH2zEo/default.jpg | False | False | LISA - LALISA내 뒷모습만 봐도 알잖아어두워질 때 분홍빛이나새하얀 조명이 ... | Music |
| 21367 | -5q5mZbe3V8 | BTS (방탄소년단) 'Life Goes On' Official MV | 2020-11-20 04:58:11+00:00 | UC3IZKseVpdzPSBaWxBxundA | Big Hit Labels | 10 | 2020-11-28 00:00:00+00:00 | BIGHIT|빅히트|방탄소년단|BTS|BANGTAN|방탄 | 150622781 | 11405030 | 126202 | 4160903 | https://i.ytimg.com/vi/-5q5mZbe3V8/default.jpg | False | False | BTS (방탄소년단) 'Life Goes On' Official MVCredits:... | Music |
| 118181 | ia6fRSeK8I0 | jai shree ram 🚩#shorts #ashortaday | 2022-03-15 03:21:02+00:00 | UCuIpkP1H2Vb_NFu8dLIWoPw | CHANDAN ART ACADEMY | 27 | 2022-03-21 00:00:00+00:00 | [None] | 149615603 | 7940311 | 0 | 48945 | https://i.ytimg.com/vi/ia6fRSeK8I0/default.jpg | False | False | NaN | Education |
| 11764 | dyRsYk0LyA8 | BLACKPINK – ‘Lovesick Girls’ M/V | 2020-10-02 04:00:13+00:00 | UCOmHUn--16B90oW2L6FRR3A | BLACKPINK | 10 | 2020-10-09 00:00:00+00:00 | YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... | 140685439 | 9217876 | 127308 | 1507605 | https://i.ytimg.com/vi/dyRsYk0LyA8/default.jpg | False | False | BLACKPINK – ‘Lovesick Girls’영원한 밤창문 없는 방에 우릴 가... | Music |
| 91383 | U3ASj1L6_sY | Adele - Easy On Me (Official Video) | 2021-10-14 23:00:11+00:00 | UComP_epzeKzvBX156r6pm1Q | AdeleVEVO | 10 | 2021-11-05 00:00:00+00:00 | There ain’t no gold|In this river|That I’ve be... | 139547582 | 4714130 | 59433 | 241678 | https://i.ytimg.com/vi/U3ASj1L6_sY/default.jpg | False | False | Official Video for Easy On Me by Adele.Shop th... | Music |
| 199180 | 1WEAJ-DFkHE | $1 vs $500,000 Plane Ticket! | 2023-04-01 20:00:04+00:00 | UCX6OQ3DkcsbYNE6H8uQQuVA | MrBeast | 24 | 2023-04-30 00:00:00+00:00 | [None] | 139409168 | 4810157 | 0 | 144548 | https://i.ytimg.com/vi/1WEAJ-DFkHE/default.jpg | False | False | Check out ALL of MrBeast’s awesome jobs or dis... | Entertainment |
| 187747 | jZGpkLElSu8 | KAROL G, Shakira - TQG (Official Video) | 2023-02-24 05:00:09+00:00 | UCz9yS18zJGQObwUL_K-ICnw | KarolGVEVO | 10 | 2023-03-04 00:00:00+00:00 | karol|tqg|tqg karol|tqg shakira|karol shakira|... | 138548083 | 5461423 | 0 | 199368 | https://i.ytimg.com/vi/jZGpkLElSu8/default.jpg | False | False | KAROL G, Shakira - TQG (Official Video)Escucha... | Music |
| 96555 | 0e3GPea1Tyg | $456,000 Squid Game In Real Life! | 2021-11-24 21:00:01+00:00 | UCX6OQ3DkcsbYNE6H8uQQuVA | MrBeast | 24 | 2021-12-02 00:00:00+00:00 | [None] | 137068663 | 10926910 | 67027 | 527142 | https://i.ytimg.com/vi/0e3GPea1Tyg/default.jpg | False | False | MAKE SURE YOU WATCH UNTIL GLASS BRIDGE IT'S IN... | Entertainment |
| 54989 | TDNNkOv8M8Q | INSANE strawberry trick! 😨 #shorts | 2021-05-08 07:34:14+00:00 | UC6D1L2vxEAg_Vi0JSxMBDgA | Dan Rhodes | 24 | 2021-05-17 00:00:00+00:00 | [None] | 127913129 | 4244408 | 195737 | 29538 | https://i.ytimg.com/vi/TDNNkOv8M8Q/default.jpg | False | False | NaN | Entertainment |
| 67359 | PRz64kSEJqs | She is foxy but not enough #Shorts | 2021-06-29 00:47:43+00:00 | UCt8z2S30Wl-GQEluFVM8NUw | Fortnite Fun TV | 24 | 2021-07-08 00:00:00+00:00 | [None] | 126907932 | 4195786 | 139672 | 2660 | https://i.ytimg.com/vi/PRz64kSEJqs/default.jpg | False | False | She is foxy but not enough #ShortsHey guys! Wa... | Entertainment |
| 190583 | HjBo--1n8lI | Rihanna’s FULL Apple Music Super Bowl LVII Hal... | 2023-02-13 03:58:18+00:00 | UCDVYQ4Zhbm3S2dlz7P1GBDg | NFL | 17 | 2023-03-18 00:00:00+00:00 | NFL|Football|American Football|sport|sports | 113116792 | 2764210 | 0 | 150958 | https://i.ytimg.com/vi/HjBo--1n8lI/default.jpg | False | False | Listen to Rihanna’s iconic hits in Spatial Aud... | Sports |
| 127582 | 8dJyRm2jJ-U | PSY - 'That That (prod. & feat. SUGA of BTS)' MV | 2022-04-29 09:00:10+00:00 | UCrDkAvwZum-UTjHmzDI2iIw | officialpsy | 10 | 2022-05-07 00:00:00+00:00 | PSY|싸이|psy|Psy|박재상|Comeback|coming back|9th|PS... | 111477556 | 6496912 | 0 | 353082 | https://i.ytimg.com/vi/8dJyRm2jJ-U/default.jpg | False | False | PSY - 'That That (prod. & feat. SUGA of BTS)' ... | Music |
| 66766 | Fw7fbKoK3e8 | MvRyhan Funny videos #tiktok #Shorts | 2021-06-25 07:37:36+00:00 | UCcFQLco2CA2uq9J2Uwcoi6Q | Mv Ryhan | 24 | 2021-07-05 00:00:00+00:00 | [None] | 106089141 | 1866882 | 97235 | 7073 | https://i.ytimg.com/vi/Fw7fbKoK3e8/default.jpg | False | False | #shorts | Entertainment |
| 43374 | CKZvWhCqx1s | ROSÉ - 'On The Ground' M/V | 2021-03-12 05:00:15+00:00 | UCOmHUn--16B90oW2L6FRR3A | BLACKPINK | 10 | 2021-03-20 00:00:00+00:00 | YG Entertainment|YG|와이지|K-pop|BLACKPINK|블랙핑크|블... | 103691157 | 7244067 | 96144 | 1559327 | https://i.ytimg.com/vi/CKZvWhCqx1s/default.jpg | False | False | On The GroundMy life's been magic seems fantas... | Music |
| 89984 | MbWq-EwUy_M | Dice Stacks from $1 to $100 | 2021-10-06 21:44:27+00:00 | UChfTcl5XfdTUCjkfuro982Q | That's Amazing Shorts | 17 | 2021-10-29 00:00:00+00:00 | [None] | 103564168 | 7174425 | 120961 | 23727 | https://i.ytimg.com/vi/MbWq-EwUy_M/default.jpg | False | False | Dice Stacking Tricks from $1 to $100 #shorts | Sports |
| 985 | hsm4poTWjMs | Cardi B - WAP feat. Megan Thee Stallion [Offic... | 2020-08-07 04:00:10+00:00 | UCxMAbVFmxKUVGAll0WVGpFw | Cardi B | 10 | 2020-08-16 00:00:00+00:00 | Cardi B|Cardi|Atlantic Records|rap|hip hop|tra... | 98442414 | 3207729 | 467717 | 310630 | https://i.ytimg.com/vi/hsm4poTWjMs/default.jpg | False | False | Cardi B - WAP feat. Megan Thee StallionStream/... | Music |
| 126767 | myjEoDypUD8 | Watch the uncensored moment Will Smith smacks ... | 2022-03-28 03:06:53+00:00 | UCIRYBXDze5krPDzAEOxFGVA | Guardian News | 25 | 2022-05-03 00:00:00+00:00 | Jada Pinkett Smith|Jada Pinkett Smith chris ro... | 98202265 | 1443032 | 0 | 245566 | https://i.ytimg.com/vi/myjEoDypUD8/default.jpg | False | False | Best actor nominee Will Smith appeared to slap... | News & Politics |
| 115170 | LrJYKxyrMwg | Hey man, we are Italian 🇮🇹😅🤷🏼♀️#shorts #funny... | 2022-02-20 20:42:28+00:00 | UCQFYC_kwJ_FSs5UvZryzDFQ | Jessi & Sean | 22 | 2022-03-06 00:00:00+00:00 | [None] | 97611742 | 4005923 | 0 | 9645 | https://i.ytimg.com/vi/LrJYKxyrMwg/default.jpg | False | False | NaN | People & Blogs |
| 135776 | kXpOEzNZ8hQ | BTS (방탄소년단) 'Yet To Come (The Most Beautiful M... | 2022-06-10 03:59:38+00:00 | UC3IZKseVpdzPSBaWxBxundA | HYBE LABELS | 10 | 2022-06-17 00:00:00+00:00 | HYBE|HYBE LABELS|하이브|하이브레이블즈 | 93952431 | 9444379 | 0 | 2469783 | https://i.ytimg.com/vi/kXpOEzNZ8hQ/default.jpg | False | False | BTS (방탄소년단) 'Yet To Come (The Most Beautiful M... | Music |
| 61775 | ggJoXFNkRUw | Turn into egg - Tutorial #Shorts | 2021-05-23 19:29:35+00:00 | UCt8z2S30Wl-GQEluFVM8NUw | Fortnite Fun TV | 24 | 2021-06-13 00:00:00+00:00 | [None] | 92004215 | 2841950 | 103019 | 1865 | https://i.ytimg.com/vi/ggJoXFNkRUw/default.jpg | False | False | Turn into egg - Tutorial #ShortsHey guys! Watc... | Entertainment |
| 182581 | TJ2ifmkGGus | 1,000 Blind People See For The First Time | 2023-01-28 21:00:00+00:00 | UCX6OQ3DkcsbYNE6H8uQQuVA | MrBeast | 24 | 2023-02-06 00:00:00+00:00 | [None] | 91624124 | 7787760 | 0 | 305146 | https://i.ytimg.com/vi/TJ2ifmkGGus/default.jpg | False | False | If you would like to support more of this sigh... | Entertainment |
| 56080 | 2Zcz3Z0baVE | HOW TO GO THROUGH THE DRESS CODE 👗🎀💂♂️|| #SHORTS | 2021-05-15 18:29:43+00:00 | UC63mNFJR8EAb8wAIJwoCmTA | 5-Minute Crafts FAMILY | 26 | 2021-05-23 00:00:00+00:00 | 5-Minute Crafts|5-Minute Crafts Family|family|... | 89075984 | 2293772 | 97541 | 13435 | https://i.ytimg.com/vi/2Zcz3Z0baVE/default.jpg | False | False | #Shorts #YouTubeShorts... | Howto & Style |
| 91374 | O2W2gUXAt78 | My hidden talent #shorts | 2021-10-09 20:14:58+00:00 | UCiJUp_NBW2D_UMQIvq2nlPg | Zach King Shorts | 23 | 2021-11-05 00:00:00+00:00 | shorts|#shorts|zach king|magic|tiktok|dance|fe... | 87284105 | 2813557 | 64294 | 4402 | https://i.ytimg.com/vi/O2W2gUXAt78/default.jpg | False | False | Not many people know about my hidden talent of... | Comedy |
| 101782 | CvCtn5Q_nzs | Crazy #alluarjun #painting #shorts #viral #tr... | 2021-12-08 13:16:02+00:00 | UCMmvhaKpOxeyneANESIYXcA | Dr.Harrsha Artist | 1 | 2021-12-29 00:00:00+00:00 | [None] | 86415224 | 5676872 | 0 | 25975 | https://i.ytimg.com/vi/CvCtn5Q_nzs/default.jpg | False | False | NaN | Film & Animation |
| 55961 | FLGCGc7sAUw | Bella Poarch - Build a B*tch (Official Music V... | 2021-05-14 04:00:16+00:00 | UCCY_8y1FjtOegxKB4s2bWqw | Bella Poarch | 22 | 2021-05-22 00:00:00+00:00 | bella poarch|bella porch|build a bitch|build a... | 84063330 | 5627368 | 162546 | 308370 | https://i.ytimg.com/vi/FLGCGc7sAUw/default.jpg | False | False | Stream Build a B*tch: http://bellapoarch.lnk.t... | People & Blogs |
| 67739 | HwpUtagJ4PE | Oddly Satisfying Video #Shorts | 2021-06-25 09:02:18+00:00 | UCUzylqtbvZ8O4kL5FQb4bUA | Thanh Thảo Official | 22 | 2021-07-10 00:00:00+00:00 | [None] | 83600439 | 2469516 | 123738 | 4934 | https://i.ytimg.com/vi/HwpUtagJ4PE/default.jpg | False | False | Thanks for watching !Please like, share and su... | People & Blogs |
| 49577 | u4HYTp4sqH8 | I broke my finger! 😨 (Behind the scenes) 😂 #sh... | 2021-04-13 06:25:39+00:00 | UC6D1L2vxEAg_Vi0JSxMBDgA | Dan Rhodes | 24 | 2021-04-20 00:00:00+00:00 | [None] | 82000798 | 2525244 | 132813 | 13534 | https://i.ytimg.com/vi/u4HYTp4sqH8/default.jpg | False | False | This one is crazy! | Entertainment |
| 63187 | XA2YEHn-A8Q | TWICE Alcohol-Free M/V | 2021-06-09 08:58:12+00:00 | UCaO6TYtlC8U5ttz62hTrZgg | JYP Entertainment | 10 | 2021-06-16 00:00:00+00:00 | TWICE|트와이스|taste of love|alcoholfree|alcoholfr... | 81850570 | 2626749 | 60378 | 934021 | https://i.ytimg.com/vi/XA2YEHn-A8Q/default.jpg | False | False | TWICE Alcohol-Free M/V TWICE The 10th Mini Alb... | Music |
| 49144 | -C-16oZxTbw | This is impossible! (Behind the scenes) 🤐 #sh... | 2021-04-10 13:57:03+00:00 | UC6D1L2vxEAg_Vi0JSxMBDgA | Dan Rhodes | 24 | 2021-04-18 00:00:00+00:00 | [None] | 81514544 | 2616550 | 144498 | 15673 | https://i.ytimg.com/vi/-C-16oZxTbw/default.jpg | False | False | Crazy trick! | Entertainment |
| 46161 | 6swmTBVI83k | Lil Nas X - MONTERO (Call Me By Your Name) (Of... | 2021-03-26 04:00:14+00:00 | UCtTfSyci2urfwXXu_eRpNRA | LilNasXVEVO | 10 | 2021-04-03 00:00:00+00:00 | lilnasx|montero|lil nas x|lil nas|nas x|call m... | 79848249 | 3287141 | 290208 | 591103 | https://i.ytimg.com/vi/6swmTBVI83k/default.jpg | False | False | Official video for “MONTERO (Call Me By Your N... | Music |
| 175771 | YxWlaYCA8MU | Jhoome Jo Pathaan Song | Shah Rukh Khan, Deepi... | 2022-12-22 05:30:09+00:00 | UCbTLwN10NoCU4WDzLf1JMOA | YRF | 10 | 2023-01-03 00:00:00+00:00 | shah rukh khan|deepika padukone|pathaan song|p... | 79358096 | 2221125 | 0 | 163810 | https://i.ytimg.com/vi/YxWlaYCA8MU/default.jpg | False | False | Can't stop ourselves from vibing to this absol... | Music |
| 56173 | i0Ye1lBEgnM | Crazy STATIC TRICK! 😨 #shorts | 2021-05-15 11:06:18+00:00 | UC6D1L2vxEAg_Vi0JSxMBDgA | Dan Rhodes | 24 | 2021-05-23 00:00:00+00:00 | [None] | 78144546 | 2549623 | 91014 | 20752 | https://i.ytimg.com/vi/i0Ye1lBEgnM/default.jpg | False | False | NaN | Entertainment |
| 35580 | xxNxqveseyI | Amazon’s Big Game Commercial: Alexa’s Body | 2021-02-02 13:25:20+00:00 | UCkLXELm63_pH7L-r-548kig | amazon | 28 | 2021-02-09 00:00:00+00:00 | [None] | 77745621 | 51199 | 5779 | 7584 | https://i.ytimg.com/vi/xxNxqveseyI/default.jpg | False | False | It took us a while, but we've found a new body... | Science & Technology |
| 192376 | YLt73w6criQ | I Paid A Real Assassin To Try To Kill Me | 2023-03-18 20:00:01+00:00 | UCX6OQ3DkcsbYNE6H8uQQuVA | MrBeast | 24 | 2023-03-27 00:00:00+00:00 | [None] | 76930214 | 3535306 | 0 | 100403 | https://i.ytimg.com/vi/YLt73w6criQ/default.jpg | False | False | File with TurboTax today to get your biggest r... | Entertainment |
| 16161 | CM4CkVFmTds | TWICE I CAN'T STOP ME M/V | 2020-10-26 08:59:54+00:00 | UCaO6TYtlC8U5ttz62hTrZgg | JYP Entertainment | 10 | 2020-11-01 00:00:00+00:00 | TWICE|트와이스|eyeswideopen mv|I cant stop me mv|i... | 74991178 | 2874511 | 63407 | 929318 | https://i.ytimg.com/vi/CM4CkVFmTds/default.jpg | False | False | TWICE I CAN'T STOP ME M/VTWICE 2nd Album Eyes ... | Music |
| 198587 | 3inw26U-os4 | Grupo Frontera x Bad Bunny - un x100to (Video ... | 2023-04-17 16:00:14+00:00 | UCKsN6xyJ2w8g7p4p9apXkYQ | Grupo Frontera | 10 | 2023-04-27 00:00:00+00:00 | bad bunny|grupo frontera|frontera|frontera gru... | 74612458 | 1635163 | 0 | 43989 | https://i.ytimg.com/vi/3inw26U-os4/default.jpg | False | False | Suscríbete a nuestro canal: https://bit.ly/Gru... | Music |
| 13186 | YTZ-GZPTND8 | AMONG US, but with 99 IMPOSTORS | 2020-10-09 00:16:14+00:00 | UClarhNTgYk5wuztsunOx2Cw | The Pixel Kingdom | 20 | 2020-10-16 00:00:00+00:00 | among us|100 player|99 player|hack|cheat|multi... | 73728043 | 2337792 | 55198 | 69154 | https://i.ytimg.com/vi/YTZ-GZPTND8/default.jpg | False | False | Mankind was raised on The Skeld... it was neve... | Gaming |
| 54970 | PkKnp4SdE-w | NCT DREAM 엔시티 드림 '맛 (Hot Sauce)' MV | 2021-05-10 09:02:33+00:00 | UCEf_Bc-KVd7onSeifS3py9g | SMTOWN | 10 | 2021-05-17 00:00:00+00:00 | [None] | 73707275 | 2318954 | 33527 | 782811 | https://i.ytimg.com/vi/PkKnp4SdE-w/default.jpg | False | False | NCT DREAM's 1st album Hot Sauce is out!Listen ... | Music |
| 55981 | qpw5i2j6cHc | Money Plinko Challenge! 💰 #shorts | 2021-05-14 22:57:41+00:00 | UC9SUulKzcBvThtnUopgrYyg | AnthonySenpai | 20 | 2021-05-22 00:00:00+00:00 | [None] | 72681293 | 1935280 | 70167 | 4646 | https://i.ytimg.com/vi/qpw5i2j6cHc/default.jpg | False | False | NaN | Gaming |
| 56155 | 4TWR90KJl84 | aespa 에스파 'Next Level' MV | 2021-05-17 09:00:02+00:00 | UCEf_Bc-KVd7onSeifS3py9g | SMTOWN | 10 | 2021-05-23 00:00:00+00:00 | [None] | 72593468 | 1734529 | 62902 | 275245 | https://i.ytimg.com/vi/4TWR90KJl84/default.jpg | False | False | aespa's new single Next Level is out!Listen an... | Music |
| 179356 | G7KNmW9a75Y | Miley Cyrus - Flowers (Official Video) | 2023-01-13 00:00:09+00:00 | UCdI8evszfZvyAl2UVCypkTA | MileyCyrusVEVO | 10 | 2023-01-21 00:00:00+00:00 | Miley|Cyrus|MCEO|Miley New Song|Miley New Albu... | 72536620 | 3023633 | 0 | 82875 | https://i.ytimg.com/vi/G7KNmW9a75Y/default.jpg | False | False | Official Video for “Flowers” by Miley CyrusLis... | Music |
Code from https://www.analyticsvidhya.com/blog/2021/05/how-to-build-word-cloud-in-python/
from wordcloud import WordCloud
text2020 = " ".join(cat.split("|")[0] for cat in data.loc[data["publishedAt"].dt.year == 2020].tags)
word_cloud = WordCloud(collocations = False, background_color = 'white').generate(text2020)
# Display the generated Word Cloud
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")
plt.show()
text2021 = " ".join(cat.split("|")[0] for cat in data.loc[data["publishedAt"].dt.year == 2021].tags)
word_cloud = WordCloud(collocations = False, background_color = 'white').generate(text2021)
# Display the generated Word Cloud
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")
plt.show()
text2022 = " ".join(cat.split("|")[0] for cat in data.loc[data["publishedAt"].dt.year == 2022].tags)
word_cloud = WordCloud(collocations = False, background_color = 'white').generate(text2022)
# Display the generated Word Cloud
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Here, we make a word cloud using Wordcloud. This was made using the data from the tags of all of the youtube trending data. This was done such that the first graph represents 2020, the second is 2021, the third is 2022.
from wordcloud import WordCloud
plt.figure(figsize=(20,10))
text2020 = (" ".join(cat.split("|")[0] for cat in data.loc[data["publishedAt"].dt.year == 2020].tags)).replace('[None]', '')
word_cloud = WordCloud(collocations = False, background_color = 'white').generate(text2020)
# Display the generated Word Cloud
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")
plt.show()
plt.figure(figsize=(20,10))
text2021 = (" ".join(cat.split("|")[0] for cat in data.loc[data["publishedAt"].dt.year == 2021].tags)).replace('[None]', '')
word_cloud = WordCloud(collocations = False, background_color = 'white').generate(text2021)
# Display the generated Word Cloud
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")
plt.show()
text2022 = (" ".join(cat.split("|")[0] for cat in data.loc[data["publishedAt"].dt.year == 2022].tags)).replace('[None]', '')
plt.figure(figsize=(20,10))
word_cloud = WordCloud(collocations = False, background_color = 'white').generate(text2022)
# Display the generated Word Cloud
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Since there was a very obvious "None" keyword present in the data and was the largest word in all 3 years, this data was cleaned so that the [None] tag is no longer present in the data used for the word clouds.
channel_cumulative_sums=unique.groupby(["channelTitle"])["view_count"].sum().sort_values(ascending=False)
channel_cumulative_sums_as_frame = channel_cumulative_sums.to_frame()
channel_cumulative_sums_as_frame["channelTitle"] = channel_cumulative_sums_as_frame.index
channel_cumulative_sums_as_frame = channel_cumulative_sums_as_frame.drop(columns="channelTitle")
temp = channel_cumulative_sums_as_frame.reset_index().head(50)
unique_for_image =unique.groupby(["channelId"])["view_count"].sum().sort_values(ascending=False).to_frame()
temp["channelId"] = unique_for_image.reset_index().head(50)["channelId"]
temp.head(20)
| channelTitle | view_count | channelId | |
|---|---|---|---|
| 0 | MrBeast | 2526660908 | UCX6OQ3DkcsbYNE6H8uQQuVA |
| 1 | BLACKPINK | 1912983541 | UC3IZKseVpdzPSBaWxBxundA |
| 2 | HYBE LABELS | 1532484746 | UCOmHUn--16B90oW2L6FRR3A |
| 3 | SMTOWN | 1369848949 | UCEf_Bc-KVd7onSeifS3py9g |
| 4 | JYP Entertainment | 1335988975 | UCaO6TYtlC8U5ttz62hTrZgg |
| 5 | NFL | 945109824 | UCDVYQ4Zhbm3S2dlz7P1GBDg |
| 6 | MrBeast Gaming | 892078757 | UCIPPMRA040LQr5QPyJEbmXA |
| 7 | NBA | 733364643 | UCWJ2lWNubArHWmf3FIHbfcQ |
| 8 | BANGTANTV | 727648334 | UCLkAepWjdylmXSltofFvsYQ |
| 9 | Marvel Entertainment | 701862407 | UCvC4D8onUfXzvjTOM-dBfEA |
| 10 | Dude Perfect | 636788063 | UCRijo3ddMTht_IHyNSNXpNQ |
| 11 | Big Hit Labels | 599584464 | UCUaT_39o1x6qWjz7K2pWcgw |
| 12 | Beast Reacts | 562235109 | UCmBA_wu8xGg1OfOkfW13Q0Q |
| 13 | Bad Bunny | 540900050 | UCmS75G-98QihSusY7NfCZtw |
| 14 | Bizarrap | 489131667 | UCjmJDM5pRKbUlVIzDYYWb6g |
| 15 | Warner Bros. Pictures | 483263857 | UCke6I9N4KfC968-yRcd5YRg |
| 16 | SSundee | 466972707 | UCz97F7dMxBNOfGYu3rx8aCw |
| 17 | Sony Pictures Entertainment | 466378833 | UCpB959t8iPrxQWj7G6n0ctQ |
| 18 | SSSniperWolf | 462433311 | UCt8z2S30Wl-GQEluFVM8NUw |
| 19 | Apple | 448863763 | UCE_M8A5yxnLfW0KghEeajjw |
Here, we calculate the total views that all of the channels that have videos that went trending have accumulated. This will show us a distribution of a popularity metric. The "channel_cumulative_sums" variable uses the "groupby" function to group the unique data by channel title and sum up their respective view counts. This is then sorted in descending order to show the channels with the highest total views first.
To make it more presentable, the "channel_cumulative_sums" variable is converted to a pandas dataframe using the "to_frame()" method, and the "channelTitle" column is added as an index. The top 50 channels with the highest total views are stored in the "temp" variable.
Additionally, to make an image or visualization of the distribution, the "unique_for_image" variable calculates the total views for each unique channel id and sorts it in descending order. The top 50 channel ids with the highest total views are then added as a new column to the "temp" variable.
This data can be useful for analyzing the popularity of channels with videos that went trending and identifying which channels have the most views overall.
a = unique.loc[unique["publishedAt"].dt.year == 2022]
a = a["channelTitle"].value_counts().sort_index()
occ_count = pd.DataFrame({"channelTitle": a.index, "occurences": a.values})
occ_count = occ_count.sort_values(by="occurences", ascending=False)
#trim to top 100
occ_count = occ_count.head(100)
#add channelId - by looking up in unique
occ_count["channelId"] = occ_count.apply(lambda row: unique.loc[unique["channelTitle"] == row["channelTitle"]]["channelId"].iloc[0], axis=1)
occ_count.head(20)
| channelTitle | occurences | channelId | |
|---|---|---|---|
| 2443 | NFL | 124 | UCDVYQ4Zhbm3S2dlz7P1GBDg |
| 2419 | NBA | 114 | UCWJ2lWNubArHWmf3FIHbfcQ |
| 1155 | FOX Soccer | 57 | UCooTLkxcpnTNx6vfOovfBFA |
| 2428 | NBC Sports | 50 | UCqZQlzSHbVJrwrn5XvzrzcA |
| 2952 | Ryan Trahan | 46 | UCnmGIkw-KdI0W5siakKPKog |
| 551 | CBS Sports Golazo | 33 | UCET00YnetHT7tOpu12v8jxg |
| 346 | Beast Reacts | 32 | UCUaT_39o1x6qWjz7K2pWcgw |
| 1339 | Genshin Impact | 32 | UCiS882YPwZt1NfaM0gR0D9Q |
| 4244 | videogamedunkey | 31 | UCsvn_Po0SmunchJYOWpOxMg |
| 3970 | ZHC Crafts | 31 | UCPAk4rqVIwg1NCXh61VJxbg |
| 3455 | The Game Theorists | 30 | UCo_IB5145EVNcf8hw1Kku7w |
| 2422 | NBA on TNT | 30 | UCU7iRrk3xfpUk0R6VdyC1Ow |
| 303 | BWF TV | 29 | UChh-akEbUM8_6ghGVnJd6cQ |
| 710 | Clash of Clans | 29 | UCD1Em4q90ZUK2R5HKesszJg |
| 1219 | First We Feast | 29 | UCPD_bxCRGpmmeQcbe2kpPaA |
| 3060 | SeaWattgaming | 29 | UCSpfz1IyUA1NBH-cgj8ygUw |
| 2222 | Marvel Entertainment | 28 | UCvC4D8onUfXzvjTOM-dBfEA |
| 1005 | Dude Perfect | 28 | UCRijo3ddMTht_IHyNSNXpNQ |
| 4164 | morgans vlogs | 28 | UC-vaBe-YMpvcZL5rQ5OopZw |
| 3955 | YoungBoy Never Broke Again | 26 | UClW4jraMKz6Qj69lJf-tODA |
Here, we count the unique occurences of videos from different channels in 2022 in trending.
import requests
import os
if not os.path.exists("images"):
print("Making images directory")
os.mkdir("images")
unique_for_image =unique.groupby(["channelId"])["view_count"].sum().sort_values(ascending=False).to_frame()
unique_for_image["channelId"] = unique_for_image.index
first50 = unique_for_image["channelId"].head(50).to_csv( header=None, index=None).strip('\n').replace("\r", "") .replace("\n", ",")
z = requests.get('https://www.googleapis.com/youtube/v3/channels?part=snippet&id='+first50+'&fields=items(id%2Csnippet%2Fthumbnails)&key=AIzaSyBzTEil14Vsa_8W64NCr98X-snvPapD2wo')
data = z.json()
for item in data["items"]:
img_data = requests.get(item["snippet"]["thumbnails"]["default"]["url"]).content
with open('images/'+item["id"]+'.jpg', 'wb') as handler:
handler.write(img_data)
We decide that the best way to graph the data would be to do so with a basic cirular packing graph. In order to do this, we need to extract a few more things from the internet. Mainly, we need the thumbnails for each of the channels that we found above.
In this block of code, we are using the YouTube API to retrieve the thumbnail images of the top 50 channels with the highest total views from the "unique_for_image" dataframe.
First, we group the data by channel id, calculate the sum of their respective view counts, and sort it in descending order. Then we add the channel id as a new column in the dataframe.
Next, we use the "head()" method to select the first 50 channel ids, and convert it to a comma-separated string using the "to_csv()" method. This string is then used to make a GET request to the YouTube API with the appropriate parameters and API key.
The response from the API is stored in the "data" variable as a JSON object. We loop through the "items" list in the JSON object, and retrieve the thumbnail image URL for each channel using the appropriate keys.
We then make a GET request to the URL, retrieve the image data using the "content()" method, and save it in a file with the channel id as the filename in the "images" folder using the "open()" method.
This code allows us to retrieve and save the thumbnail images of the top 50 channels with the highest total views.
first50 = occ_count["channelId"].iloc[:50].to_csv( header=None, index=None).strip('\n').replace("\r", "") .replace("\n", ",")
z = requests.get('https://www.googleapis.com/youtube/v3/channels?part=snippet&id='+first50+'&fields=items(id%2Csnippet%2Fthumbnails)&key=AIzaSyBzTEil14Vsa_8W64NCr98X-snvPapD2wo')
data = z.json()
for item in data["items"]:
img_data = requests.get(item["snippet"]["thumbnails"]["default"]["url"]).content
with open('images/'+item["id"]+'.jpg', 'wb') as handler:
handler.write(img_data)
#split request into 2 groups - google api does not allow more than 50 ids per request
second50 = occ_count["channelId"].iloc[50:100].to_csv( header=None, index=None).strip('\n').replace("\r", "") .replace("\n", ",")
z = requests.get('https://www.googleapis.com/youtube/v3/channels?part=snippet&id='+second50+'&fields=items(id%2Csnippet%2Fthumbnails)&key=AIzaSyBzTEil14Vsa_8W64NCr98X-snvPapD2wo')
data = z.json()
for item in data["items"]:
img_data = requests.get(item["snippet"]["thumbnails"]["default"]["url"]).content
with open('images/'+item["id"]+'.jpg', 'wb') as handler:
handler.write(img_data)
Here we run the same code on the top 100 channels based on number of unique occurences on trending in 2022.
# Code from https://stackoverflow.com/questions/51486297/cropping-an-image-in-a-circular-way-using-python
from PIL import Image, ImageDraw
ids = temp['channelId'].iloc[::-1]
for id in ids:
# Open the input image as numpy array, convert to RGB
img=Image.open('images/'+id+'.jpg').convert("RGB")
npImage=np.array(img)
h,w=img.size
# Create same size alpha layer with circle
alpha = Image.new('L', img.size,0)
draw = ImageDraw.Draw(alpha)
draw.pieslice([0,0,h,w],0,360,fill=255)
# Convert alpha Image to numpy array
npAlpha=np.array(alpha)
# Add alpha layer to RGB
npImage=np.dstack((npImage,npAlpha))
# Save with alpha
Image.fromarray(npImage).save('images/'+id+'.png')
ids =occ_count['channelId'].iloc[:100]
for id in ids:
# Open the input image as numpy array, convert to RGB
img=Image.open('images/'+id+'.jpg').convert("RGB")
npImage=np.array(img)
h,w=img.size
# Create same size alpha layer with circle
alpha = Image.new('L', img.size,0)
draw = ImageDraw.Draw(alpha)
draw.pieslice([0,0,h,w],0,360,fill=255)
# Convert alpha Image to numpy array
npAlpha=np.array(alpha)
# Add alpha layer to RGB
npImage=np.dstack((npImage,npAlpha))
# Save with alpha
Image.fromarray(npImage).save('images/'+id+'.png')
This code crops the png images into circles for graphing. The images should be of the top 50 by view count and the 100 by occurances in trending in 2022.
# Code taken from https://www.python-graph-gallery.com/circular-packing-1-level-hierarchy, slightly modified
import circlify
circles = circlify.circlify(
temp["view_count"].tolist(),
show_enclosure=False,
target_enclosure=circlify.Circle(x=0, y=0, r=1)
)
# import libraries
import circlify
import matplotlib.pyplot as plt
# Create just a figure and only one subplot
fig, ax = plt.subplots(figsize=(20,20))
# Title
ax.set_title('Top 50 Channels by Cumulative Viewcount of Trending Videos in 2022')
# Remove axes
ax.axis('off')
# Find axis boundaries
lim = max(
max(
abs(circle.x) + circle.r,
abs(circle.y) + circle.r,
)
for circle in circles
)
plt.xlim(-lim, lim)
plt.ylim(-lim, lim)
# list of labels
labels = temp['channelTitle'].iloc[::-1]
ids = temp['channelId'].iloc[::-1]
# print circles
for circle, label, ids in zip(circles, labels, ids):
x, y, r = circle
ax.add_patch(plt.Circle((x, y), r, alpha=1, linewidth=2, edgecolor='black', facecolor='none'))
#plt.annotate(label, (x,y ) , va='center', ha='center', fontsize=7, fontweight='bold')
# load the image
img = plt.imread('images/'+ids+'.png')
# add the image as an annotation
ax.imshow(img, extent=[x-r, x+r, y-r, y+r], alpha=1)
circles = circlify.circlify(
occ_count["occurences"].tolist(),
show_enclosure=False,
target_enclosure=circlify.Circle(x=0, y=0, r=1)
)
# import libraries
import circlify
import matplotlib.pyplot as plt
# Create just a figure and only one subplot
fig, ax = plt.subplots(figsize=(20,20))
# Title
ax.set_title('Top 100 Channels by Occurences of Videos in Trending in 2022')
# Remove axes
ax.axis('off')
# Find axis boundaries
lim = max(
max(
abs(circle.x) + circle.r,
abs(circle.y) + circle.r,
)
for circle in circles
)
plt.xlim(-lim, lim)
plt.ylim(-lim, lim)
# list of labels
labels = occ_count['channelTitle'].iloc[::-1]
ids = occ_count['channelId'].iloc[::-1]
# print circles
for circle, label, ids in zip(circles, labels, ids):
x, y, r = circle
ax.add_patch(plt.Circle((x, y), r, alpha=1, linewidth=2, edgecolor='black', facecolor='none'))
#plt.annotate(label, (x,y ) , va='center', ha='center', fontsize=7, fontweight='bold')
# load the image
img = plt.imread('images/'+ids+'.png')
# add the image as an annotation
ax.imshow(img, extent=[x-r, x+r, y-r, y+r], alpha=1)
The same code is run again for the other data that was parsed earlier.
Let's try to use some models for classification on this dataset. The best categorical variable we have is the category, so let's use view count and likes. We are going to use the unique dataframe for this. Take note that I did some tweaking on the parameters to get the maximum accuracy.
You can learn more about these models by following their scikit-learn pages:
Decision Tree
Random Forest
K Nearest Neighbors
First, let's try a decision tree.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import metrics
tree = DecisionTreeClassifier(max_depth = 10)
X = unique[["view_count", "likes"]]
y= unique["categoryId"].astype("int")
# from https://www.datacamp.com/tutorial/decision-tree-classification-python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# from https://www.datacamp.com/tutorial/decision-tree-classification-python
tree = tree.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = tree.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.29098067287043666
It looks like we got a pretty bad accuracy, but at least it's better than randomly guessing out of 15 categories, which would provide us with 6.66% accuracy. Let's try again with a random forest classifier, which should prevent from overfitting, possibly.
# from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth = 15, random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.29142806012884753
This is marginally better, but not by much. It looks like this is the limit for decision tree classifiers.
Let's try another type of classifier, k nearest neighbors.
# from https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=100)
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.29205440229062274
K nearest neighbors don't do much better. Let's see the decision boundaries for the training dataset.
from sklearn.inspection import DecisionBoundaryDisplay
def graph_boundary(model):
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111)
DecisionBoundaryDisplay.from_estimator(
model, X_train, alpha=0.8, eps=0.5, ax= ax
)
sns.scatterplot(
x=X["view_count"],
y=X["likes"],
hue=y,
alpha=1.0,
edgecolor="black",
)
plt.xlabel("Viewcount")
plt.ylabel("Likes")
graph_boundary(tree)
plt.title("Decision Tree")
plt.show()
graph_boundary(clf)
plt.title("Random Forest")
plt.show()
graph_boundary(neigh)
plt.title("K Nearest Neighbors")
plt.show()
From this, we can see that the decision boundaries are not very good. Because the data doesnt separate into clusters, it is very hard to draw good decision boundaries in the first place.
Say you wanted to become a trending youtuber. How would you do so? This section of our analysis will examine channels that oftentimes end up on trending and what their common characteristics are. The first possible explanation that occurred to our group of how a video ends up on trending is how quickly a video accumulates views. To examine this, we'll find the amount of time between a video was published and when it became trending and the amount of views it had at that time it became trending.
Our expectations would be a positive relationship. If a video ends up on trending a while after a video was released, it would have more time to accumulate views and thus, should have a higher view count.
unique["TimeDiff"] = (unique["trending_date"] - unique["publishedAt"]).dt.days
plt.figure(figsize=(20,10))
plot = plt.scatter(np.array(unique["TimeDiff"]), np.array(unique["view_count"]))
plt.ylabel("View Count")
plt.xlabel("Time Between Trending and Release (In Days)")
plt.title("Views Versus Time Difference")
plt.show()
Hmm. This doesn't seem very telling. We don't see a clear positive relationship. A lot of videos that don't accumulate a lot of views in a short amount of time make the cut but also there appears to be a few videos that have a huge amount of views in a short amount of time that make trending. Moreover, as the time increases, the view count doesn't necessarily increase either. We'll come back to this.
Let's look at a plot of how many times certain channels appear in the trending list and compare it to the number of views it gets. This will show whether channels that frequently appear on trending produce videos that have many views.
a = unique["channelTitle"].value_counts().sort_index()
occ_count = pd.DataFrame({"channelTitle": a.index,
"occurences": a.values})
result = unique.groupby('channelTitle')['view_count'].median()
occ_count_med = pd.merge(result, occ_count, on = "channelTitle")
plt.figure(figsize=(20,10))
plot = plt.scatter(np.array(occ_count_med["view_count"]), np.array(occ_count_med["occurences"]))
plt.ylabel("Occ")
plt.xlabel("ViewCount")
for i, row in occ_count_med.iterrows():
if row['view_count'] > 100000000 or row['occurences'] > 100 or row["channelTitle"] == "MrBeast":
plt.annotate(row['channelTitle'], (row['view_count'], row['occurences']))
plt.title("Median Video Viewcount vs Occurrences in Trending By Channel")
plt.show()
This doesn't seem very telling either. We can see that the accounts that have the most occurrences in trending actually seem to have a very low median view count. Namely, NBA and NFL are off the charts on the amount of times they make it to trending but seem to have a shockingly low median viewcount. Whereas the account that has a extremely large amount of views is one that has very few occurrences in trending videos. It seems the only channel that makes it onto trending many times and also has consistent viewership is MrBeast.
A simple explanation for this could be that some channels make a single video that make it to trending through a massive amount of views and never appear again on trending. To eliminate the effects of these "one hit wonders", instead of looking at meadian, let's take a look at the maximum viewed video for each channel versus the amount of times that channel appears on trending. This way, we can see if channels that consistently make it on to trending have videos that got a lot of views. A possible explanation that they stay trending could be because they are the most popular in their own category by a large margin (e.g. maybe NBA has consistently more than any other sports content) but this is difficult to check because we don't have data regarding non-trending sports videos.
result = unique.groupby('channelTitle')['view_count'].max()
occ_count_max = pd.merge(result, occ_count, on = "channelTitle")
plt.figure(figsize=(20,10))
plot = plt.scatter(np.array(occ_count_max["view_count"]), np.array(occ_count_max["occurences"]))
plt.ylabel("Occ")
plt.xlabel("ViewCount")
plt.title("Maximum Video Viewcount vs Occurrences in Trending By Channel")
for i, row in occ_count_max.iterrows():
if row['view_count'] > 130000000 or row['occurences'] > 80:
plt.annotate(row['channelTitle'], (row['view_count'], row['occurences']))
This is much better! We see that BlackPink has a somewhat high occurence on trending but also a video with an enourmous viewcount. We see Mr Beast stays winning and NFL actually has a video with a large viewcount! Chandan Art Academy falls significantly due to a simple explanation; they only showed up on trending once with a video that got an extremely large number of views. It's overtaken significantly when we look at the maximum viewcounts of several big channels that have very successful videos but churn out content that have a lower median viewcount. However, we see that a lot of channels that occur frequently don't necessarily have these "standout" videos, such as NBC, SSundee, ESPN, and Saturday Night Live. Again, a possible justification could be that they consistently outperform their own category but we don't necessarily have that data. We could compare it to other trending videos in its category. Let's take a look at that later.
Another possible explanation could be if these videos have a lot of likes and interactions.
result = unique.groupby('channelTitle')['likes'].median()
like_count_med = pd.merge(result, occ_count, on = "channelTitle")
plt.figure(figsize=(20,10))
plot = plt.scatter(np.array(like_count_med["likes"]), np.array(like_count_med["occurences"]))
plt.ylabel("Occ")
plt.xlabel("Likes")
plt.title("Median Video Likes vs Occurrences in Trending By Channel")
for i, row in like_count_med.iterrows():
if row['likes'] > 3000000 or row['occurences'] > 80:
plt.annotate(row['channelTitle'], (row['likes'], row['occurences']))
Again, not much to look at. We still have outliers in Chandan Art Academy and NBA and NFL. This is probably because likes are very correlated with viewcount so it should be similar shaped. Let's look at comment count
result = unique.groupby('channelTitle')['comment_count'].median()
comm_count_med = pd.merge(result, occ_count, on = "channelTitle")
plt.figure(figsize=(20,10))
plot = plt.scatter(np.array(comm_count_med["comment_count"]), np.array(comm_count_med["occurences"]))
plt.ylabel("Occ")
plt.xlabel("Comments")
plt.title("Median Video Comments vs Occurrences in Trending By Channel")
for i, row in comm_count_med.iterrows():
if row['comment_count'] > 1000000 or row['occurences'] > 80:
plt.annotate(row['channelTitle'], (row['comment_count'], row['occurences']))
Strangely, Chandan Art Academy falls off completely but is replaced by 5911 Records. Otherwise, not much difference. We do see a polarization though. A lot of the videos don't have that many comments and are grouped at less than a million comments with only a few breaking a million comments whereas in the other graphs, it was more distributed with more points breaking a lot of views or likes.
Let's return to the time difference graph. Perhaps the reason NBA and NFL get on trending very quickly is because they accumulate a lot of view in a very short amount of time.
result = unique.groupby('channelTitle')['TimeDiff'].median()
result = pd.DataFrame({"channelTitle": result.index,
"TimeDiff": result.values})
view = unique.groupby('channelTitle')['view_count'].mean()
occ_count_max = pd.merge(result, view, on = "channelTitle")
plt.figure(figsize=(20,10))
plot = plt.scatter(np.array(occ_count_max["TimeDiff"]), np.array(occ_count_max["view_count"]))
plt.ylabel("Median ViewCount")
plt.xlabel("TimeDiff")
for i, row in occ_count_max.iterrows():
if row['view_count'] > 80000000 or row['TimeDiff'] >31 or row["channelTitle"] == "NBA" or row["channelTitle"] == "NFL"or row["channelTitle"] == "NBC Sports" or row["channelTitle"] == "SSundee":
plt.annotate(row['channelTitle'], (row['TimeDiff'], row['view_count']))
As we can see, NBA, NBC Sports, NFL, and SSundee don't standout in their time difference between trending and viewcount either. Chandan art academy once again distinguishes itself but again, most likely because of its single datapoint.
What happens if we only examine NBA and NFL by its own category of sports? Perhaps it is because it is at the top of its category
sports = unique[unique["categoryId"] == 17]
result = sports.groupby('channelTitle')['view_count'].median()
view_count_med = pd.merge(result, occ_count, on = "channelTitle")
plt.figure(figsize=(20,10))
plot = plt.scatter(np.array(view_count_med["view_count"]), np.array(view_count_med["occurences"]))
plt.ylabel("Occ")
plt.xlabel("Views")
plt.title("Median Views vs Occurrences in Trending By Channel for Sports")
for i, row in view_count_med.iterrows():
if row['view_count'] > 10000000 or row['occurences'] > 80 or (row['view_count'] > 6000000 and row['occurences'] > 50):
plt.annotate(row['channelTitle'], (row['view_count'], row['occurences']))
This is definitely better! We see that NFL and NBA clearly distinguishes itself from the main cluster of sports channels that tend to have fewer occurrences in trending. There are some with an extremely large number of views but did not make it to trending all that often. The most consistent one in the sports category is, surprisingly, Dude Perfect, with both a good amount of occurrences and also a solid viewcount. This is strange because we didn't see Dude Perfect in the overall graph of channel median viewcount versus channel occurrences in trending. However, we can start to make sense of NBA/NFL being a consistently good viewcount in its category, thus making it a consistent occurrence on the overall trending category.
Let's take a look at if there is a correlation between likes viewcounts, comment counts, and maybe even Viewcount over time. The code for linear regression was taken from one of our previous projects on linear regression with some changes, namely Steven's project 3. Our null hypothesis in these cases is that there is no correlation.
import sklearn
import sklearn.preprocessing
import sklearn.svm
import sklearn.model_selection
import sklearn.linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as smf
reg = LinearRegression().fit((np.array(unique["view_count"])).reshape(-1, 1), unique["likes"])
pred_line = reg.predict(np.array(unique["view_count"]).reshape(-1, 1))
ols1 = smf.ols(formula="likes ~ view_count ", data=unique).fit()
print(ols1.summary())
plt.figure(figsize=(20,10))
unique["1st_pred"] = pred_line
plot1 = plt.plot(np.array(unique["view_count"]), np.array(pred_line), color="red")
plot = sns.scatterplot(data=unique, x='view_count', y='likes')
plot.set(
ylabel='likes',
title='Viewcount Versus Likes')
plt.show()
OLS Regression Results
==============================================================================
Dep. Variable: likes R-squared: 0.727
Model: OLS Adj. R-squared: 0.727
Method: Least Squares F-statistic: 9.929e+04
Date: Fri, 12 May 2023 Prob (F-statistic): 0.00
Time: 22:32:16 Log-Likelihood: -5.0630e+05
No. Observations: 37251 AIC: 1.013e+06
Df Residuals: 37249 BIC: 1.013e+06
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 901.9685 1075.362 0.839 0.402 -1205.770 3009.707
view_count 0.0477 0.000 315.109 0.000 0.047 0.048
==============================================================================
Omnibus: 39416.070 Durbin-Watson: 1.280
Prob(Omnibus): 0.000 Jarque-Bera (JB): 28642151.353
Skew: 4.501 Prob(JB): 0.00
Kurtosis: 138.545 Cond. No. 7.63e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.63e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
This line fits well to the data. The probablility related to the F-statistic (p value) is 0, or very close to 0, which is less than 0.05. This means that we can reject the null hypothesis that there is no correlation.
reg = LinearRegression().fit((np.array(unique["view_count"])).reshape(-1, 1), unique["comment_count"])
pred_line = reg.predict(np.array(unique["view_count"]).reshape(-1, 1))
ols1 = smf.ols(formula="comment_count ~ view_count ", data=unique).fit()
print(ols1.summary())
plt.figure(figsize=(20,10))
unique["1st_pred"] = pred_line
plot1 = plt.plot(np.array(unique["view_count"]), np.array(pred_line), color="red")
plot = sns.scatterplot(data=unique, x='view_count', y='comment_count')
plot.set(
ylabel='comments',
title='Viewcount Versus Comment Count')
plt.show()
OLS Regression Results
==============================================================================
Dep. Variable: comment_count R-squared: 0.353
Model: OLS Adj. R-squared: 0.353
Method: Least Squares F-statistic: 2.032e+04
Date: Fri, 12 May 2023 Prob (F-statistic): 0.00
Time: 22:32:17 Log-Likelihood: -4.6271e+05
No. Observations: 37251 AIC: 9.254e+05
Df Residuals: 37249 BIC: 9.254e+05
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -7685.8342 333.680 -23.034 0.000 -8339.857 -7031.811
view_count 0.0067 4.69e-05 142.541 0.000 0.007 0.007
==============================================================================
Omnibus: 108202.043 Durbin-Watson: 1.864
Prob(Omnibus): 0.000 Jarque-Bera (JB): 12057366999.812
Skew: 39.972 Prob(JB): 0.00
Kurtosis: 2789.020 Cond. No. 7.63e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.63e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Same story here with comment count. We can safely reject the null hypothesis of no correlation.
unique["min_date"] = unique["trending_date"].min()
unique["daydiff"] = (unique["trending_date"] - unique["min_date"]).dt.days
reg = LinearRegression().fit((np.array(unique["view_count"])).reshape(-1, 1), unique["daydiff"])
pred_line = reg.predict(np.array(unique["daydiff"]).reshape(-1, 1))
ols1 = smf.ols(formula="view_count ~ daydiff ", data=unique).fit()
print(ols1.summary())
plt.figure(figsize=(20,10))
unique["1st_pred"] = pred_line
plot1 = plt.plot(np.array(unique["daydiff"]), np.array(pred_line), color="red")
plot = sns.scatterplot(data=unique, x='daydiff', y='view_count')
plot.set(
xlabel = 'Days Since Beginning of Data',
ylabel='View Count',
title='Viewcount Versus Time')
plt.show()
OLS Regression Results
==============================================================================
Dep. Variable: view_count R-squared: 0.003
Model: OLS Adj. R-squared: 0.003
Method: Least Squares F-statistic: 115.0
Date: Fri, 12 May 2023 Prob (F-statistic): 8.63e-27
Time: 22:32:17 Log-Likelihood: -6.3788e+05
No. Observations: 37251 AIC: 1.276e+06
Df Residuals: 37249 BIC: 1.276e+06
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.234e+06 7.02e+04 46.070 0.000 3.1e+06 3.37e+06
daydiff -1302.1094 121.425 -10.724 0.000 -1540.105 -1064.114
==============================================================================
Omnibus: 70249.421 Durbin-Watson: 1.649
Prob(Omnibus): 0.000 Jarque-Bera (JB): 197388673.277
Skew: 14.338 Prob(JB): 0.00
Kurtosis: 358.459 Cond. No. 1.18e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.18e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
This p value is 8.63e-27, which is still way less than 0.5, so we can reject the null hypothesis here. The slope of the line is -1302, which says that later videos will have less views by about 1300 views per day. Although this is not a lot, it is possibly connected to Covid, where many people could watch more Youtube because they were sitting at home. As time went on, people ended their quarantine, leading to the downward trend in Youtube watching.
Our project covered a lot of ground in analyzing youtube trends over time. To summarize, we can clearly see that there is a strong positive correlation between views and likes and views and comments. We can see that there is a very slight negative correlation in view counts over time. The explanation for this could be that after quarantine ended in 2021/2022, people spent less time at home engaging in leisure and returned to the workplace/schoolplace, thus dropping the overall views slightly.
Moreover, we can see that gaming videos that make trending oftentimes don't have an outrageous number of views. The videos that make trending that do extremely well (with high views and likes) tend to be mostly music videos but some entertainment videos are included too.
We also created a model that seems to be a bit overfitting at predicting the category of a video based on the amount of views/likes/comments it has. Our final model ended with an approximate 30% accuracy, which can definitely be improved on but is much better than a random guess at 15 categories!
We also have included visualization on the most viewed and most occurred on trending channels in bubble plots to provide a nicer visualization onto what channels are viewed a lot. We also provided some insight into what makes a channel appear often on trending and how it relates to the amount of median views the channel gets for a video it puts on trending (if you want to appear a lot, just try and top your category). We see there's not much of a correlation between how many views a channel gets and how far from its release is a video put on trending, and no clear correlation between the amount of times a channel appears on trending and how many views it gets. All in all, despite having some inconclusive results, we have provided some insight into the kaggle dataset for trending youtube videos and some useful visualizations on the subject but there is still work to be done in drawing reasonable trendlines.